\documentclass{llncs}

\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage[T1]{fontenc}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{amsmath}
\usepackage{url}
\usepackage{lmodern}
\usepackage{nonfloat}

\begin{document}

\title{Automatic Authorship Attribution\\ for Texts in Croatian Language\\ Using
Combinations of Features}
\author{Tomislav Reicher \and Ivan Krišto \and Igor Belša \and Artur Šilić}
\institute{Faculty of Electrical and Computing Engineering\\
University of Zagreb\\
Unska 3, 10000 Zagreb, Croatia\\
\email{\{tomislav.reicher, ivan.kristo, igor.belsa, artur.silic\}@fer.hr}}

\maketitle
\vspace{-10pt}
\begin{abstract}
In this work we investigate the use of various character, lexical, and syntactic
level features and their combinations in automatic authorship attribution. Since
the majority of text representation features are language specific, we examine
their application on texts written in Croatian language. Our work differs from
the similar work in at least three aspects. Firstly, we use slightly different
set of features than previously proposed. Secondly, we use four different
data sets and compare the same features across those data sets to draw stronger
conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier. We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91\% accuracy and $F_1$ measure for blogs, 93\% acc. and $F_1$ for articles, and 99\% acc. and $F_1$ for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.

\textbf{Key words:} author attribution, function words, POS \emph{n}-grams,
feature combinations, SVM.
\end{abstract}

\vspace{-20pt}
\section{Introduction}
Automatic authorship attribution is a process in the field of text classification
dealing with author identification of a given text. It can be interpreted
as a problem of text classification based on linguistic features specific to certain
authors. The main concern in computer-based authorship attribution is defining
the appropriate characterization of the text. Such characterization should
capture the writing style of the authors \cite{coyotl2006authorship}.

Authorship attribution can help in document indexing, document filtering and
hierarchical categorization of web pages \cite{luyckx2005shallow}. These
applications are common in the field of information retrieval. It must be noted
that authorship attribution differs from plagiarism detection. Plagiarism
detection attempts to detect similarities between two substantially different
pieces of work. However, it is unable to determine if they were produced by the
same author or not
\cite{de2001mining}.

The problem of authorship attribution can be divided into three categories
\cite{zhao2005effective}: binary, multi-class and single-class (or one-class)
classification. Binary classification solves the problem when the data set
contains the texts written by one of two authors. Multi-class classification is a
generalization of the binary classification when there are more than two authors
in the data set. One-class classification is applied when only some of the texts
from the data set are written by a particular author while the authorship of all
the other texts is unspecified. This classification ascertains whether a given
text belongs to a single known author or not.

This paper presents a study of multi-class classification for the texts written
in the Croatian language. The work is oriented on the combination and evaluation
of different text representation features on different data sets. The rest of
the paper is organized in the following manner. Section 2 discusses related work in authorship attribution
and similar problems. Section 3 introduces different types of text representation
features that we have utilized. Section 4 describes the classification method, Section 5
describes the used data sets and Section 6 presents the evaluation methods and
experiment results. The conclusion and future work are given in Section 7.

\section{Related Work}
There are several approaches to author attribution in respect of different text
representation features used for the classification. Based on those features, the
following taxonomy can be made \cite{stamatatos2009survey}: \emph{character}, \emph{lexical}, \emph{syntactic}, \emph{semantic} and \emph{application-specific features}. Feature choice for topic classification of Croatian texts was investigated in \cite{malenica,silic}. These methods use character and lexical features. The following paragraphs
describe character, lexical and syntactic features in more depth and relate our
work with the existing research.

\emph{Character features} are the simplest text representation features. They
consider text as a mere sequence of characters and are thereby usable for any
natural language or corpus. Various measures can be defined, such as
characters frequencies, digit frequencies, uppercase and lowercase character
frequencies, punctuation marks frequencies, etc.\ \cite{de2001mining}. Another
type of character based features, which has been proven as quite successful
\cite{peng2003language,stamatatos2006ensemble}, considers extracting
frequencies of character \emph{n}-grams. 

Text representation using \emph{lexical features} is characterized by dividing
the text into a sequence of tokens (words) that group into sentences. Features
directly derived from that representation are the length of words, the
length of sentences and vocabulary richness. This types of features have been used in
\cite{holmes1994authorship,mendenhall1887} . Results achieved demonstrate that
they are not sufficient for the task mostly due to their
significant dependence on the text genre and length. However, taking advantage of
features based on frequencies of different words, especially function words,
can produce fairly better results
\cite{argamon2005measuring,koppel2003exploiting,uzuner2005comparative,zhao2005effective}.
Analogous to character \emph{n}-grams, word \emph{n}-gram features
can be defined. They are shown to be quite successful as well
\cite{coyotl2006authorship,keselj2003n}.

The use of \emph{syntactic features} is governed by the idea that authors tend to
unconsciously use similar syntactic patterns. Information related to the
structure of the language is obtained by an in-depth syntactic analysis of the
text, usually using some sort of an NLP tool. A single text is characterized by
the presence and frequency of certain syntactic structures. Syntax-based features
were introduced in \cite{van1996outside}, where the rewrite rule frequencies
were utilized. Stamatatos et al.\ \cite{stamatatos2001computer} used noun, verb
and prepositional phrase frequencies. Using a Part-of-speech (POS) tagger one can
obtain POS tags and POS tag \emph{n}-gram frequencies. Using such features
excellent results can be achieved
\cite{diederich2003authorship,koppel2003exploiting,kukushkina2001using,luyckx2005shallow}.
Koppel et al.\ \cite{koppel2003exploiting} show that the use of grammatical
errors and informal styling (e.g., writing sentences in capitals) as text features
can be useful in authorship attribution.

Our work is based on the composition and evaluation of various afore mentioned
text representation features. We use different character, lexical and syntactic
features and adapt them for the use with the Croatian language. We use
punctuation marks and vowels frequency as character features. Word length,
sentence length and function words frequencies are used as lexical features. For
the syntax-based features we use those relatively similar to POS tag and POS tag
\emph{n}-grams frequencies.


\section{Text Representation}
When constructing an authorship attribution system, the central issue is the
selection of sufficiently discriminative features. A feature is discriminative if
it is common for one author and rare for all the others. Due to the large number
of authors some complex features are very useful if their distribution is
specific to each author. Moreover, as the texts from dataset greatly differ in
length and topic, it is necessary to use the features independent of such
variations. If the features were not independent of such variation that
would most certainly reduce generality of system's application and could lead
to a decrease of accuracy (e.g., relating author to concrete topic or terms). In the following subsections we will describe different features used.

\subsection{Function Words}
\label{sec:funkcijske-rijeci}
Function words, such as adverbs, prepositions, conjunctions, or interjections,
are words that have little or no semantic content of their own. They usually
indicate a grammatical relationship or a generic property
\cite{zhao2005effective}. Although one would assume that frequencies of some of
the less used function words would be a useful indicator of author's style, even the
frequencies of more common function words can adequately distinguish between the
authors. Due to the high frequency of function words and their significant
roles in the grammar, an author usually has no conscious control over their
usage in a particular text \cite{argamon2005measuring}. They are also
topic-independent. Therefore, function words are good indicators of the author's
style.

It is difficult to predict whether these words will give equally good results for
different languages. Moreover, despite the abundance of research in this field,
due to various languages, types and sizes of the texts, it is currently
impossible to conclude if these features are generally effective
\cite{zhao2005effective}.

In addition to function words, in this work we also consider frequencies of
auxiliary verbs and pronouns. Their frequencies might be
representative of the author's style. This makes the set of totally
652 function words that were used.

\subsection{Idf Weighted Function Words}
\label{sec:funkcijske-rijeci-idf}

The use of features based on function word frequencies often implies the
problem of determining how important, in terms of discrimination, a
specific given function word is \cite{diederich2003authorship}.

To cope with this problem we used a combination of $L_p$ normalization of the
length and transformation of the function word occurrence frequency, in
particular \emph{idf} (inverse document frequency) measure.

Idf measure is defined as \cite{diederich2003authorship}
\begin{equation}
F_{idf}(t_k) = \log \frac{n_d}{n_d(t_k)},
\label{equ:idf}
\end{equation}
where $n_d(t_k)$ is the number of texts from learning data set that contain word
$t_k$ and $n_d$ the total number of the texts in that learning data set. The
shown measure gives high values for words that appear in a small number of
texts and are thus very discriminatory.

As the \emph{idf} measure uses only the information of the presence of a
certain function word in the texts ignoring frequency of that word, a word that appears
many times in one single text and once in all the others gets the same value as
the one which appears once in all of the texts. Therefore it is necessary to
multiply the obtained \emph{idf} measure, of the given function word, with the
occurrence frequency of that word in the observed text.

\subsection{Part-Of-Speech}
\label{sec:rijeci-grupe}
Next three features we use are morphosyntactic features. To obtain these
features, some sort of NLP tool was required. Croatian language is
morphologically complex and it is difficult to develop an accurate and robust POS
or MSD (Morphosyntactic Description) tagger. We utilized the method given in
\cite{snajder08automatic} that uses inflectional morphological lexicon to obtain
POS and MSD of each word from the text. However, the problem of homography remained. Homographs are equally spelled words with different meanings in their context, sequently having different MSDs. Large percentage of
word forms are ambiguous and using the given method we cannot disambiguate
homographs, so all possible POS and MSD tags for such words are considered.

Simplest morphosyntactic features that we use are features based on POS frequency, similar to the one used in \cite{kukushkina2001using}. The given text is
preprocessed and the POS of each word in the text is determined. Features are
obtained by counting the number of occurrences of different POS tags and then
normalized by the total number of words in the text. The used POS tags are
adpositions, conjunctions, interjections, nouns, verbs,
adjectives, and pronouns. In addition, category ``unknown'' is introduced for all
the words not found in the lexicon.


\subsection{Word Morphological Category}
\label{sec:morphosyntactic}

More complex morphosyntactic features that we use take advantage of words' morphological categories. Each word can be described by a set of morphological
categories depending on the word's POS: \emph{case}, \emph{gender}, and \emph{number}
for nouns, \emph{form}, \emph{gender}, \emph{number}, and \emph{person} for verbs, and
\emph{case}, \emph{degree}, \emph{gender}, and \emph{number} for adjectives.
Moreover, each morphological category can take one of a number of different
values; for example, noun can be in nominative, genitive, dative, accusative, vocative, locative, or instrumental case. Features that we use are obtained by counting the number of occurrences of different values for each morphological category and then normalizing them by the total number of words in the text. If, for example, a sentence consists of two nouns and an adjective in nominative case then the number of occurrences of nominative case would be equal three and the normalized nominative case frequency would equal one.

\subsection{Part-Of-Speech \emph{n}-grams}
\label{sec:ngrami-tipova}
%POS \emph{n}-grams frequency-based features are features that utilize the idea of word \emph{n}-grams applied to POS of words in the text.
POS \emph{n}-grams are similar to word \emph{n}-grams. Instead of using the actual words found in a text, all words are replaced by their POS to make a new text representation that is then used to count the number of occurrences of different POS \emph{n}-grams. The number of occurrences of every single POS \emph{n}-gram is normalized by the total number of \emph{n}-grams to make this feature independent of the text length. The POS we use are those given in Subsection \ref{sec:rijeci-grupe}.
Since POS \emph{n}-gram features can cause very large dimensionality, only
3-grams are considered. Example of a POS \emph{n}-gram on a word 3-gram ``Adam i
Eva'' (``Adam and Eve'') is the ``noun conjunction noun'' trigram.

Further, we investigate the use of \emph{n}-grams on POS and function words in
parallel. The nouns, verbs and adjectives in a given text are replaced by their
POS. All the other words, which are considered to be the function words, are
left as they were. Therefore, the 3-gram ``Adam i Eva'' transforms to ``noun i
noun'' 3-gram. Frequencies of such \emph{n}-grams are then used as features.
The use of such features is motivated by the idea of capturing contextual
information of function words. Due to many different pronouns, conjunctions,
interjections and prepositions that make many different \emph{n}-grams, frequency
filtering is applied -- only the frequency of 500 most frequent 3-grams in the
training data set is considered. Used dimension reduction method is not optimal,
so in future work other methods should be evaluated, such as
information gain, $\chi^2$ test, mutual information, maximum relevance, minimum
redundancy, etc.

\subsection{Other Features}
\label{sec:znacajke-manje}
Other features that we use are simple character and lexical features:
punctuation marks, vowels, words length and sentence length frequencies.

A set of following punctuation marks is used: {``.''}, {``,''}, {``!''}, {``?''},{``{'}''}, {``"''}, {``-''}, {``:''}, {``;''}, {``+''}, and {``*''}. Their appearance in the text is counted and
the result is normalized by the total number of characters in the text. 

Features based on the frequency of vowel occurrence (a, e, i, o, u) are
obtained in an equal manner.

%The frequencies of words lengths are obtained by counting the lengths of all the words from a text and then normalizing them by the total number of words in the given text.

The frequencies of word lengths are obtained by counting occurrences of words for each possible word length. The number is normalized by the total number of words in the given text. To increase robustness, words longer than 10 characters are counted as if they were 10 characters long.

The sentence length frequency is obtained in a similar procedure.
For the same reason as with the word lengths, sentences longer than 20 words are
counted as if they were 20 words long.

All features suggested in this subsection have weak discriminatory power
on their own. However, they have proved to be very useful in combination with other features as shown in Table \ref{tbl:eval}.

\section{Classification}
% Representation of documents by real number vectors  enables easy use of
% classifiers that search decision functions, i.e., boundaries in vector
% space.
All features mentioned in this work use some sort of frequency information
that makes it possible to represent texts with real-valued feature vectors. Having texts in a form of vectors, for the classification we used an SVM (Support Vector Machine) with radial basis function as the kernel. It is shown that, with the use of parameter selection, linear SVM is a special case of an SVM with RBF kernel \cite{keerthi2003asymptotic}, which removes the need to consider a linear SVM as a potential classifier. RBF kernel is defined as:
\begin{equation}
k(\mathbf{x_i},\mathbf{x_j})=\exp(-\gamma \|\mathbf{x_i} - \mathbf{x_j}\|^2).
\end{equation}

Before commencing the learning process and classification with SVM, we scale
the data to ensure equal contribution of every attribute to the classification. Components of every feature vector are scaled to an interval $[0, 1]$.

For the application of SVM classifier in practical problems it is reasonable to
consider SVM with soft margins defined by parameter $C$, as in
\cite{cortes1995support}. Parameter $C$ together with $\gamma$ used in RBF kernel completely define an SVM model. The search for the appropriate parameters $(C, \gamma)$, i.e., model selection, is done by the means of cross-validation: using
the 5-fold cross-validation on the learning set parameters $(C, \gamma)$ yielding
the highest accuracy are selected and the SVM classifier is learned using them.
The accuracy of classification is measured by the expression:
\begin{equation}
acc = \frac{n_c}{N}, % ako bude problema, ovo staviti kao ``inline'' jednadžbu.
\end{equation}
where $n_c$ is the number of correctly classified texts and $N$ is the total number of
texts.
Parameters $(C, \gamma)$ that were considered are: $C \in \{2^{-5}, 2^{-4},
\ldots, 2^{15}\}$, $\gamma \in \{2^{-15}, 2^{-14}, \ldots, 2^3\}$ \cite{CC01a}.

\section{Data Set}
\label{sec:podatci}
We have used three different data sets in our experimental evaluation. First we
experimented using an on-line archive of \emph{proofread} articles (journals)
from a daily Croatian newspaper ``Jutarnji list,'' available at
\small \url{http://www.jutarnji.hr/komentari/}\normalsize.  The data set consists of 4571 texts
written by 25 different authors. The texts are not evenly distributed among
authors -- numbers vary from 14 to 1285 texts per author with an average of 182
texts per author. The articles also differ in size -- the lowest average number
of words in the text per author is 315 words, and the highest average is 1347
words. An average number of words in a text per author is 717 words. Considering
this analysis, we can conclude that the used data set is very heterogeneous.
Since the topics in these articles tend to be time-specific, to avoid
overfitting we split the set by dates -- 20\% of the newest articles of each
author are taken for testing (hold-out method).

The second data set that we have used is a collection of on-line blogs. Blog authors were
carefully selected: only those authors with more than 40 posts
and posts longer than 500 words were considered. In addition, special attention
was paid to select only posts that contain original information as authors often tend to
quote other sources like news, books, other authors, etc. As a result, we obtained a dataset
consisting of 22 authors with a total of 3662 posts. An
average number of words in a post is 846 words. For the testing purposes we
split the data set by dates using the same technique that was applied to the
articles.

The third data set is comprised of Croatian literature classics. We have
selected 52 novels and novellas from 20 different authors and divided them by
chapters, treating each chapter as a separate text (the same approach was used by
Argamon et al.\ \cite{argamon2005measuring}). All chapters longer than 4000 words
were further divided. That resulted with a total of 1149 texts, with an average
of 2236 words in a text. The lowest average number of words in a text per
author is 1407 words and the highest average is 3652 words. We have randomly 
selected 20\% of the texts from each author for the test set, leaving the rest as
a training set.

The last data set used is a set of Internet forum (message board)
posts from \small \url{http://www.forum.hr/}\normalsize. We have selected all the posts from the threads in
\textit{Politics} category. Initially there were 2269 different authors with a
total of 64765 posts. Subsequently we have chosen only those authors that on
average have more than 100 words per post and more than 64 posts in total. As a
result, the data set consisting of 19 authors and 4026 posts was obtained. Twenty
percent of the newest posts of each author were taken as the test set.


\begin{table*}[htb]
\normalsize\caption{Evaluation of different features using accuracy and $F_1$ measures [\%]}
\scriptsize
\centering
\begin{tabular}{l c c c c c c c}%
\toprule%
Features & Number of & \multicolumn{2}{c}{Newspapers} &
\multicolumn{2}{c}{Blogs} & \multicolumn{2}{c}{Books}\\ & Features &
\hspace{5pt} Acc \hspace{5pt} & \hspace{5pt} $F_1$ \hspace{5pt} & \hspace{5pt} Acc \hspace{5pt} & \hspace{5pt} $F_1$ \hspace{5pt} & \hspace{5pt} Acc \hspace{5pt} & \hspace{5pt} $F_1$ \hspace{5pt}\\
\midrule
Function Words ($\mathcal{F}$) & 652 & 88.39 & 87.38 & 87.21 & 87.08 & 97.24 &
97.12 \\ Idf Weighted Func. Words ($\mathcal{I}$) & 652 & 87.96 & 86.84 & 87.10
& 86.94 & 97.24 & 97.11 \\ Word POS ($\mathcal{C}$) & 8 & 44.50 & 38.18 & 34.95
& 29.85 & 38.97 & 33.63 \\ Punctuation Marks ($\mathcal{P}$) & 11 & 57.50 &
52.93 & 60.64 & 59.35 & 76.55 & 76.31 \\ Vowels ($\mathcal{V}$) & 5 & 30.54 & 16.24 &
25.58 & 21.84 & 42.41 & 39.72 \\ Words Length ($\mathcal{L}$) & 11 & 43.19 &
33.22 & 38.59 & 35.11 & 47.24 & 44.15 \\ Sentence Length ($\mathcal{S}$) & 20 &
40.49 & 33.70 & 28.00 & 21.79 & 25.86 & 20.12 \\ POS \emph{n}-grams, 1st
meth. ($\mathcal{N}_1$) & 512 & 71.29 & 68.68 & 56.67 & 54.05 & 80.34 & 79.83 \\
POS \emph{n}-grams, 2nd meth. ($\mathcal{N}_2$) & 500 & 76.09 & 72.52 & 67.03 &
66.16 & 91.72 & 91.45 \\ Word Morph. Category ($\mathcal{M}$) & 22 & 61.17
& 58.91 & 58.32 & 57.04 & 67.24 & 66.13 \\ $\mathcal{C}$, $\mathcal{M}$ & 30 &
63.17 & 62.06 & 59.65 & 57.98 & 71.72 & 71.84 \\ $\mathcal{P}$, $\mathcal{F}$ &
663 & 92.41 & 91.93 & 88.75 & 88.69 & \textbf{98.62} & \textbf{98.57} \\
 $\mathcal{F}$, $\mathcal{M}$ & 674 & 91.18 & 90.68 & 88.53 & 88.41 & 97.93 &
 97.89 \\ $\mathcal{F}$, $\mathcal{N}_1$ & 1164 & 89.44 & 88.52 & 84.45 & 84.04
 & 97.59 & 97.54 \\ $\mathcal{F}$, $\mathcal{N}_2$ & 1152 & 90.92 & 90.43 & 86.55
 & 86.25 & 97.93 & 97.80 \\ $\mathcal{F}$, $\mathcal{C}$ & 660 & 90.05 & 89.52 &
 88.20 & 87.99 & 97.59 & 97.49 \\ $\mathcal{I}$, $\mathcal{M}$ & 674 & 90.84 &
 90.35 & 88.53 & 88.30 & 97.93 & 97.88 \\ $\mathcal{N}_1$, $\mathcal{M}$ & 534 &
 71.38 & 70.15 & 62.29 & 60.20 & 84.48 & 83.80 \\ $\mathcal{I}$, $\mathcal{M}$,
$\mathcal{C}$ & 682 & 91.36 & 90.89 & 88.31 & 88.16 & 97.93 & 97.81 \\
$\mathcal{P}$, $\mathcal{F}$, $\mathcal{L}$ & 674 & \textbf{93.37} & 92.96 &
90.30 & 90.22 & \textbf{98.62} & \textbf{98.57} \\ $\mathcal{P}$, $\mathcal{F}$, $\mathcal{L}$,
$\mathcal{M}$ & 696 & \textbf{93.46} & \textbf{93.09} & 90.85 & 90.65 & 98.28 &
98.23 \\ $\mathcal{P}$, $\mathcal{F}$, $\mathcal{L}$, $\mathcal{C}$ & 682 &
93.28 & 92.97 & 89.97 & 89.87 & 98.62 & 98.57 \\ $\mathcal{P}$, $\mathcal{F}$,
$\mathcal{L}$, $\mathcal{N}_2$ & 1174 & 90.31 & 89.24 & 87.32 & 87.21 & 98.28 &
98.14 \\ $\mathcal{P}$, $\mathcal{F}$, $\mathcal{L}$, $\mathcal{M}$,
$\mathcal{C}$ & 704 & \textbf{93.46} & \textbf{93.15} & 90.85 & 90.73 &
\textbf{98.62} & \textbf{98.58} \\ $\mathcal{P}$, $\mathcal{F}$, $\mathcal{L}$, $\mathcal{M}$,
$\mathcal{N}_2$ & 1196 & 91.54 & 90.52 & 88.09 & 87.95 & 98.28 & 98.14 \\
$\mathcal{P}$, $\mathcal{F}$, $\mathcal{L}$, $\mathcal{M}$, $\mathcal{C}$,
$\mathcal{N}_2$ & 1204 & 91.80 & 90.88 & 88.64 & 88.49 & 97.93 & 97.81 \\
$\mathcal{P}$, $\mathcal{F}$, $\mathcal{V}$, $\mathcal{L}$ & 679 & 93.37 &
93.04 & 90.19 & 90.11 & 98.62 & 98.57 \\ $\mathcal{P}$, $\mathcal{C}$, $\mathcal{N}_2$,
$\mathcal{M}$ & 541 & 85.42 & 84.11 & 73.87 & 73.06 & 94.14 & 94.02 \\
$\mathcal{F}$, $\mathcal{M}$, $\mathcal{C}$, $\mathcal{N}_1$ & 1194 & 89.62 &
88.70 & 86.11 & 85.69 & 98.62 & 98.57 \\ $\mathcal{S}$, $\mathcal{P}$, $\mathcal{F}$,
$\mathcal{L}$ & 694 & 92.67 & 92.18 & \textbf{91.29} & \textbf{91.23} &
\textbf{99.66} & \textbf{99.65} \\ $\mathcal{S}$, $\mathcal{P}$,
$\mathcal{N}_2$, $\mathcal{L}$ & 542 & 83.33 & 81.68 & 73.98 & 73.31 & 95.17 &
95.06 \\ $\mathcal{S}$, $\mathcal{P}$, $\mathcal{F}$, $\mathcal{V}$,
$\mathcal{L}$ & 699 & 93.19 & 92.85 & 91.18 & 91.10 & 98.62 & 98.58 \\
$\mathcal{S}$, $\mathcal{P}$, $\mathcal{F}$, $\mathcal{V}$, $\mathcal{L}$,
$\mathcal{M}$ & 721 & 93.19 & 92.88 & \textbf{91.51} & \textbf{91.38} &
\textbf{98.62} & \textbf{98.57} \\ $\mathcal{S}$, $\mathcal{P}$, $\mathcal{F}$, $\mathcal{V}$,
$\mathcal{L}$, $\mathcal{M}$, $\mathcal{N}_1$ & 1233 & 92.41 & 91.96 & 89.08 &
88.71 & 98.28 & 98.23 \\ $\mathcal{S}$, $\mathcal{P}$, $\mathcal{F}$,
$\mathcal{V}$, $\mathcal{L}$, $\mathcal{M}$, $\mathcal{N}_1$, $\mathcal{C}$ &
1241 & 92.41 & 91.96 & 89.75 & 89.44 & 98.28 & 98.23 \\
\bottomrule
\end{tabular}
\label{tbl:eval}
\end{table*}

\section{Evaluation}
\label{sec:evaluacija}
Classifier quality is measured as the ratio of correctly classified texts and
the total number of texts in the test set i.e., micro average accuracy.
First we tested all features separately and then we tested their various combinations. Results of the evaluation on articles, blogs, and books
are shown in Table \ref{tbl:eval}.

Total accuracy does not explain the behavior of a classifier for every class by itself, so precision and recall were calculated for each class and then
used to calculate the total macro $F_1$ measure. 
% For particular class $c$,
% precision is the ratio of the number of correctly classified documents in $c$ and
% the number of all documents which are classified as $c$. Recall is the ratio of the
% number of correctly classified documents in $c$ and the number of all the
% documents truly in $c$. $F_1$ measure is calculated for every class $c_i$
% according to the following expression:
% \begin{equation}
% F_i = \frac{2 \cdot precision_i \cdot recall_i}{precision_i + recall_i}
% \end{equation}
% where $precision_i$ and $recall_i$ are measures of precision and recall for
% the class $c_i$.
% 
% Total macro $F_1$ measure is calculated by the following expression:
% \begin{equation}
% F_w = \frac{\sum^{n}_i |c_i|\cdot F_i}{\sum^n_i|c_i|}
% \end{equation}
% where $n$ is total number of classes, $|c_i|$ number of documents in the class
% $c_i$ and $F_i$ an $F_1$ measure for class $c_i$.

The SVM parameter selection is very time consuming. Thus, not all of the
feature combinations were tested. We focused on the evaluation of feature
combinations that are based on syntactic analysis of the Croatian language and
those that are based on function word frequencies as they have proven to be most
successful. Furthermore, some of the features like function words and idf
weighted function words are very similar and the evaluation of their combination
would show no considerable improvements. Also, certain feature combinations made
no significant contributions so we did not conduct further evaluation.

As we can see from Table \ref{tbl:eval}, highest accuracies are achieved by
using combination of simple features such as function words, punctuation marks,
word length, and sentence length frequencies ($\mathcal{F}$, $\mathcal{P}$, $\mathcal{L}$, $\mathcal{S}$). On the other hand, the combinations of syntax-based features are slightly less accurate. If we use the combination of function words and syntax-based features, accuracy remains nearly the same as without the use of syntax-based features. We can also see that features based on function word frequencies achieve excellent results on different data sets, regardless
of the text form as long as the data set is comprised of texts that
are sufficiently large. By increasing the average text length we can see that
classification results get higher. To further evaluate the impact of the text
length on performance of used features, we have performed experiments on much
shorter texts. We have used forum posts data set, but the results were not satisfying.
The highest accuracy of 27\% was achieved by using punctuation marks frequencies.

\section{Conclusion}
It is shown that the authorship attribution problem, when applied to
morphologically complex languages, such as the Croatian language, can be
successfully solved using a combination of some relatively simple features. Our
results with 91\%, 93\% and 99\% accuracy on texts with average length longer than 700 words are quite notable considering the fact
that different heterogeneous data sets were used. Evaluation on different data
sets shows that the same feature combination based on function words,
punctuation marks, word length, and sentence length frequencies, achieve the
highest results. Therefore, we conclude that those
features are the most suitable for use in the task of authorship
attribution. Experimental results for short texts indicate that more complex features should be utilized. In the similar work of Uzuner et al.\ \cite{uzuner2005comparative}
on a smaller set of authors, fairly larger texts, and with the use of
different classifiers nearly the same results and conclusions are obtained. The
direct comparison of the results has proved very difficult due to different types of data sets used  and different number of authors considered. However, our results fall within the interval of previously reported results, which range from 70\% to 99\%
\cite{argamon2005measuring,coyotl2006authorship,keselj2003n,luyckx2005shallow,stamatatos2001computer,uzuner2005comparative}.

In addition, there are no other reported methods or results for the Croatian
language, nor any of the related South Slavic languages, so our work presents a
basis for further research. In future work, homography should be resolved in order to get more accurate results of syntax-based features. As suggested in \cite{coyotl2006authorship,keselj2003n,peng2003language}, features based on word and character \emph{n}-grams should be compared to features used in this work. Also, comparison to semantic based features should be conducted as well. It would be interesting to investigate other features that are able to cope with shorter texts such as Internet forum posts, e-mails, on-line conversations, or text messages.

\section*{Acknowledgement}
This research has been supported by the Croatian Ministry of Science, Education
and Sports under the grant No.~036-1300646-1986. The authors would like to thank Bojana Dalbelo Ba\v{s}i\'c, Jan \v{S}najder, and the reviewers for their helpful suggestions.

\bibliographystyle{mysplncs}
\bibliography{literatura}

\end{document}
